Contents

This vignette applies the UINMF algorithm (Kriebel et al. 2021) implemented in the LIGER package to integrate the single-cell RNA sequencing (scRNA-Seq) and the mass spectrometry-based single-cell proteomics (MS-SCP) data. This data set is provided by the SingleCellMultiModal package. The major advantage of using the recently published UINMF algorithm is that it does not require that the different modalities are acquired from the same cell and it can include features that are specific to each modality.

1 Load data

We use the SCoPE2 function from the SingleCellMultiModal package to retrieve the MS-SCP and scRNA-Seq data set. SCoPE2 refers to the name of the acquisition protocol for MS-SCP data.

library(SingleCellMultiModal)
mae <- SCoPE2("macrophage_differentiation",
              modes = "rna|protein",
              dry.run = FALSE)
## Warning in split.default(filepaths, fact): data length is not a multiple of
## split variable
mae
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] protein: SingleCellExperiment with 3042 rows and 1490 columns
##  [2] rna: SingleCellExperiment with 32738 rows and 20274 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save all data to files

The mae object contains two assays containing either the MS-SCP data (protein) or the scRNA-Seq data (rna). Refer to the SCoPE2 vignette (vignette("SCoPE2", package = "SingleCellMultiModal")) for more information about the data set.

2 RNA data Processing

We next need to prepare the data before integration. While the protein data has already been extensively processed, the RNA data has undergone only minimal processing. RNA data has already been filtered for cells that pass QC, but still need to be processed in order to select for highly variable features.

2.1 Remove empty features

First, we remove features that were not detected. A gene is detected if it has at least one count in at least one of the cells.

sel <- rowSums(counts(mae[["rna"]]) > 0) > 0
table(sel)
## sel
## FALSE  TRUE 
## 10149 22589

About 30 % of the genes (10149 out of 32738) in the data were not detected. We remove them.

mae[["rna"]] <- mae[["rna"]][sel, ]

2.2 Log-normalization

We used the scuttle package to normalization the scRNA-Seq data. This is performed using the logNormCounts. Adding transform = "none" will create a new data matrix containing the normalized counts before log-transformation.

library(scuttle)
## Loading required package: SingleCellExperiment
mae[["rna"]] <- logNormCounts(mae[["rna"]], transform = "none")
mae[["rna"]] <- logNormCounts(mae[["rna"]])

Notice that two new data assay were added to the RNA assay, called normcounts and logcounts.

assays(mae[["rna"]])
## List of length 3
## names(3): counts normcounts logcounts

3 Features selection

3.1 Highly variable genes

For the scRNA-Seq data, we select highly variable genes (HVG). Those can be identified by modelling the gene-variance relationship from the logcounts data. This is performed using the modelGeneVar from the scran package.

library(scran)
varModel <- modelGeneVar(mae[["rna"]])

This model decompose the gene variance into technical variability (fitted trend) and biological variability (residuals).

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.4     ✔ dplyr   1.0.7
## ✔ tidyr   1.1.3     ✔ stringr 1.4.0
## ✔ readr   2.0.1     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse()   masks IRanges::collapse()
## ✖ dplyr::combine()    masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::count()      masks matrixStats::count()
## ✖ dplyr::desc()       masks IRanges::desc()
## ✖ tidyr::expand()     masks S4Vectors::expand()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::first()      masks S4Vectors::first()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()     masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename()     masks S4Vectors::rename()
## ✖ dplyr::slice()      masks IRanges::slice()
ggplot(data.frame(varModel)) +
    aes(x = mean) +
    geom_point(aes(y = total)) +
    geom_line(aes(y = tech), colour = "red") +
    ylab("variance")

We filter the 2000 genes that have highest biological variability, constraint to features with a false discover rate smaller than 0.3 (the hypotheses being the biological variability is greater than the technical variability against there are no difference between the two types of variability). We store the selection results in the data set.

topHvg <- getTopHVGs(varModel,
                     n = 2000,
                     var.field = "bio",
                     fdr.field = "FDR",
                     fdr.threshold = 0.3)
rowData(mae[["rna"]])$top2000HVG <- rownames(mae[["rna"]]) %in% topHvg

3.2 Highly variable proteins

For the MS-SCP data, we cannot model the mean-variance relationship since the data already went through protein-wise normalization. Therefore the distribution of the average protein expression is centered around zero. We will only focus on the protein variance

vars <- rowVars(assay(mae[["protein"]]))
hist(vars)

Although this approach is far from being perfect and highlight the current lack of principled MS-SCP data processing workflows, we select the top 2000 proteins that have highest variance.

thres <- sort(vars, decreasing = TRUE)[2000]
rowData(mae[["protein"]])$top2000HVP <- vars >= thres

4 Convert feature names

Before proceeding to the integration, we need to convert the features names (genes and proteins) to a common naming. We decided to convert to HGNC symbols because this type of annotation is easier to interpret from downstream analysis results.

The conversion is performed using biomaRt and relying on the Ensembl database. We use the human data base since cells in the data were isolated from human cell line cultures.

library(biomaRt)
human <- useMart("ensembl", "hsapiens_gene_ensembl")

The features names in the rna assay are already encoded as HGNC symbols and so do not require additional processing.

head(rownames(mae[["rna"]]))
## [1] "RP11-34P13.7"  "RP11-34P13.8"  "AL627309.1"    "RP11-34P13.9" 
## [5] "AP006222.2"    "RP4-669L17.10"

The feature names in the protein assay are however encoded as Uniprot protein IDs that does not match the symbols in the rna assay.

head(rownames(mae[["protein"]]))
## [1] "A0A075B6H9" "A0A0B4J1V0" "A0A0B4J237" "A0A1B0GTH6" "A0A1B0GUA6"
## [6] "A0A1B0GUU1"

We therefore convert those proteins IDs to HGNC symbols. Note that HGNC symbols are not available for all protein IDs, resulting in some gene symbols to be NA. Proteins with unassigned HGNC symbols are removed from the analysis.

protIds <- rownames(mae[["protein"]])
conversion <- getBM(attributes = c("uniprot_gn_id", "hgnc_symbol"),
                    filters = "uniprot_gn_id",
                    values = protIds,
                    mart = human,
                    uniqueRows = TRUE)
rownames(mae[["protein"]]) <-
    conversion$hgnc_symbol[match(protIds, conversion$uniprot_gn_id)]
mae[["protein"]] <- mae[["protein"]][!is.na(rownames(mae[["protein"]])), ]

Now that the feature names for both modalities are encoded using the same convention, we can check the overlap between the two assays.

all <- union(rownames(mae[["rna"]]),
             rownames(mae[["protein"]]))
table(`in rna` = all %in% rownames(mae[["rna"]]),
      `in protein` = all %in% rownames(mae[["protein"]]))
##        in protein
## in rna  FALSE  TRUE
##   FALSE     0   384
##   TRUE  20035  2520

5 Export data

Export data to csv files for processing/exploring with third-paty tools.

(The code chunks below aren’t evaluated)

write.csv(colData(mae[["rna"]]), file = "../data/rna_sample_annotation.csv")
write.csv(colData(mae[["protein"]]), file = "../data/protein_sample_annotation.csv")
write.csv(assay(mae[["rna"]]), file = "../data/rna_expression.csv")
write.csv(assay(mae[["protein"]]), file = "../data/protein_expression.csv")

6 Integrate scRNA-Seq and MS-SCP data

The integration performed in this vignette uses the UINMF algorithm described in Kriebel et al. 2021. The algorithm decomposes the data (\([E_1, P_1]\) and [E_2, P_2]) in different latent components. The underlying model is illustrated in this figure taken from the original article.

The UINMF algorithm is available from the liger package. Currently, this available on the U_algorithm branch of the GitHub repository.

## BiocManager::install("welch-lab/liger", ref = "U_algorithm")
library(liger)

6.1 Create the LIGER object

The LIGER package implements its own data structure. We therefore need to create a liger object from the mae object. Since the data are easily accessible, this conversion is straightforward to perform. Note, that we don’t use the log-transformed data but rather the data on the count scale. This is because we here assume that the biological processes can be modelled by additive effects. This assumption would not hold when integrating the data in logarithmic scale. We therefore extract the normalized counts for the scRNA-Seq data and exponentiate the MS-SCP data.

lig <- createLiger(list(rna = normcounts(mae[["rna"]]),
                        protein = 2^assay(mae[["protein"]])),
                   make.sparse = TRUE,
                   take.gene.union = FALSE,
                   remove.missing = TRUE)

Normalize data by dividing by the summed expression per cell.

lig <- liger::normalize(lig)

6.2 Define shared and unshared features

We will here tell the UINMF algorithm which features are shared across modalities and which features are specific of modality. We retrieve the genes and proteins that were retained from the feature selection procedure above.

selProts <- rownames(mae[["protein"]])[rowData(mae[["protein"]])$top2000HVP]
selRNAs <- rownames(mae[["rna"]])[rowData(mae[["rna"]])$top2000HVG]

6.2.1 Shared

The shared features are the intersection between the selected genes and the selected proteins. We provide those shared features to the liger object through the var.genes slot.

lig@var.genes <- intersect(selProts, selRNAs)
length(lig@var.genes)
## [1] 230

6.2.2 Unshared

The unshared features are the selected genes or proteins that are not part of the set of shared features. Those are stored in the var.unshared.features slot.

lig@var.unshared.features <- list(rna = selRNAs[!selRNAs %in% lig@var.genes],
                                  protein = selProts[!selProts %in% lig@var.genes])
lapply(lig@var.unshared.features, length)
## $rna
## [1] 1774
## 
## $protein
## [1] 1681

6.3 Perform integration

This section will perform the integration so to speak. We first need to scale each columns (cell) to unit variance. This is to provide the same weights to all cells during the integration.

lig <- scaleNotCenter(object = lig,
                      remove.missing = TRUE)

Next, we fit the UINMF model to the data using the optimizeALS function.

lig <- optimizeALS(lig,
                   k = 30, ## Number of latent factors
                   lambda = 5,
                   thresh = 1e-06,
                   max.iters = 30,
                   use.unshared = TRUE)
## [1] "Performing Factorization using UINMF and unshared features"
## [1] "Processing"
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
## Current seed  1  current objective  40782154
## Best results with seed 1.

Finally, we combine the \(H_{rna}\) and \(H_{protein}\) matrices in a single \(H\) matrix using quantile_norm function. This allows to integrate the protein and transcriptomic latent spaces in a common latent space.

lig <- quantile_norm(lig, ref_dataset = "rna")

7 Visualization of the integration

metadata(mae)$Hnorm <- lig@H.norm
metadata(mae)$W <- lig@W
metadata(mae)$V <- lig@V

We now assess the integration of the proteomic to the transcriptomic data. Each cell is characterized by 30 latent factors. For easier visualization, we reduce the factors to two dimensions by applying UMAP to the integrated \(H\) matrix. This is done internally by the runUMAP function from the liger package.

library(scater)
## 
## Attaching package: 'scater'
## The following objects are masked from 'package:liger':
## 
##     runTSNE, runUMAP
umap <- calculateUMAP(metadata(mae)$Hnorm,
                      ncomponents = 2,
                      scale = TRUE,
                      n_neighbors = 10,
                      transpose = TRUE)
rownames(umap) <- rownames(metadata(mae)$Hnorm)
metadata(mae)$UMAP <- umap

Then, using some data manipulation to extract the dimension reduction and add the modality, we can visualize the data integration.

df <- data.frame(UMAP = metadata(mae)$UMAP)
df <- cbind(df, colData(mae)[rownames(df), ])
df$Modality <- ifelse(startsWith(rownames(df), "i"), "protein", "rna")
gg <- ggplot(df) +
    aes(x = UMAP.1,
        y = UMAP.2,
        colour = Modality) +
    geom_point(size = 0.5, alpha = 0.3) +
    theme_minimal() +
    theme(legend.position = "bottom") +
    xlab("UMAP 1") +
    ylab("UMAP 2")
gg + scale_colour_manual(values = c("#7332a8", "orange2"),
                         na.value = "grey")

From this graph, we can see that protein and RNA data do overlap indicating that the cells that were analyzed do indeed come from the same experimental design.

gg + aes(colour = celltype)

8 Cluster analysis

library(scran)
library(igraph)
g <- buildSNNGraph(t(metadata(mae)$Hnorm),
                   k = 15,
                   type = "rank")
clust <- cluster_louvain(g)$membership
names(clust) <- rownames(metadata(mae)$Hnorm)
colData(mae)[names(clust), "cluster"] <- paste0("cl", clust)
df <- data.frame(UMAP = metadata(mae)$UMAP)
df <- cbind(df, colData(mae)[rownames(df), ])
df$Modality <- ifelse(startsWith(rownames(df), "i"), "protein", "rna")
ggplot(df) +
    aes(x = UMAP.1,
        y = UMAP.2,
        colour = cluster) +
    geom_point(size = 0.5, alpha = 0.3) +
    geom_label(data = . %>% group_by(cluster) %>%
                   summarize(UMAP.1 = median(UMAP.1),
                             UMAP.2 = median(UMAP.2)),
               aes(label = cluster), size = 2) +
    theme_minimal() +
    xlab("UMAP 1") +
    ylab("UMAP 2")

expRatio <- sum(df$Modality == "rna") / nrow(df)
group_by(df, cluster, Modality) %>%
    summarise(n = n()) %>%
    group_by(cluster) %>%
    mutate(p = n / sum(n)) %>%
    ggplot() +
    aes(x = cluster,
        y = p,
        fill = Modality) +
    geom_bar(stat = "identity") +
    geom_hline(yintercept = expRatio) +
    ylab("Proportion") +
    theme_minimal()
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.

randRatio <-
    sum(!is.na(df$celltype) & df$celltype == "Monocyte") / sum(!is.na(df$celltype))
filter(df, Modality == "protein") %>%
    group_by(cluster, celltype) %>%
    summarise(n = n()) %>%
    group_by(cluster) %>%
    mutate(p = n / sum(n)) %>%
    ggplot() +
    aes(x = cluster,
        y = p,
        fill = celltype) +
    geom_bar(stat = "identity") +
    geom_hline(yintercept = randRatio) +
    ylab("Proportion") +
    theme_minimal()
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.

filter(df, Modality == "protein") %>%
    group_by(cluster, batch_MS) %>%
    summarise(n = n()) %>%
    rbind(data.frame(cluster = "expected",
                     batch_MS = unique(df$batch_MS),
                     n = 1)) %>%
    group_by(cluster) %>%
    mutate(p = n / sum(n)) %>%
    ggplot() +
    aes(x = cluster,
        y = p,
        fill = batch_MS) +
    geom_bar(stat = "identity") +
    ylab("Proportion") +
    theme_minimal() +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.

8.1 Cluster annotation

rna <- getWithColData(mae, "rna")
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
##   removing 1490 sampleMap rows not in names(experiments)
##   removing 1490 colData rownames not in sampleMap 'primary'
res <- findMarkers(rna, groups = rna$cluster,
                   pval.type = "some",
                   min.prop = 0.7,
                   assay.type = "logcounts")
markers <- c("OASL", "DDX58", "ISG15", "IFIT3", "OAS1")
plots <- lapply(markers, function(marker) {
    data.frame(exprs = logcounts(rna)[marker, ],
               UMAP = metadata(mae)$UMAP[colnames(rna), ],
               colData(rna)[colnames(rna), ]) %>%
        ggplot +
        aes(x = UMAP.1,
            y = UMAP.2,
            colour = exprs) +
        geom_point(size = 0.3, alpha = 0.3) +
        ggtitle(marker) +
        scale_color_continuous(type = "viridis") +
        theme_minimal() +
        xlab("UMAP 1") +
        ylab("UMAP 2")
})
wrap_plots(plots)

markers <- c("PLAUR", "ITGB1", "CD82", "TGFB1", "C5AR1", "PDIA3", "HSPA5", "CD44", "MMP1")
plots <- lapply(markers, function(marker) {
    data.frame(exprs = logcounts(rna)[marker, ],
               UMAP = metadata(mae)$UMAP[colnames(rna), ],
               colData(rna)[colnames(rna), ]) %>%
        ggplot +
        aes(x = UMAP.1,
            y = UMAP.2,
            colour = exprs) +
        geom_point(size = 0.3, alpha = 0.3) +
        ggtitle(marker) +
        scale_color_continuous(type = "viridis") +
        theme_minimal() +
        xlab("UMAP 1") +
        ylab("UMAP 2")
})
wrap_plots(plots)

markers <- c("MALAT1", "NEAT1", "MYO1F", "N4BP2L2")
plots <- lapply(markers, function(marker) {
    data.frame(exprs = logcounts(rna)[marker, ],
               UMAP = metadata(mae)$UMAP[colnames(rna), ],
               colData(rna)[colnames(rna), ]) %>%
        ggplot +
        aes(x = UMAP.1,
            y = UMAP.2,
            colour = exprs) +
        geom_point(size = 0.3, alpha = 0.3) +
        ggtitle(marker) +
        scale_color_continuous(type = "viridis") +
        theme_minimal() +
        xlab("UMAP 1") +
        ylab("UMAP 2")
})
wrap_plots(plots)

markers <- c("S100A8", "S100A9", "ALOX5AP", "TREM2", "TYROBP", "FTH1", "LSP1")
plots <- lapply(markers, function(marker) {
    data.frame(exprs = logcounts(rna)[marker, ],
               UMAP = metadata(mae)$UMAP[colnames(rna), ],
               colData(rna)[colnames(rna), ]) %>%
        ggplot +
        aes(x = UMAP.1,
            y = UMAP.2,
            colour = exprs) +
        geom_point(size = 0.3, alpha = 0.3) +
        ggtitle(marker) +
        scale_color_continuous(type = "viridis") +
        theme_minimal() +
        xlab("UMAP 1") +
        ylab("UMAP 2")
})
wrap_plots(plots)

markers <- c("SPP1", "PLAUR", "TIMP1", "MMP1", "CCL2", "CCL3", "CCL7", "FCER1G", "CTSB", "TDO2", "ELANE", "IL32", "AZU1", "RPSA", "DEK", "RBMX")
plots <- lapply(markers, function(marker) {
    data.frame(exprs = logcounts(rna)[marker, ],
               UMAP = metadata(mae)$UMAP[colnames(rna), ],
               colData(rna)[colnames(rna), ]) %>%
        ggplot +
        aes(x = UMAP.1,
            y = UMAP.2,
            colour = exprs) +
        geom_point(size = 0.3, alpha = 0.3) +
        ggtitle(marker) +
        scale_color_continuous(type = "viridis") +
        theme_minimal() +
        xlab("UMAP 1") +
        ylab("UMAP 2")
})
wrap_plots(plots)